Mail Filtering with Procmail

This document is meant as a gentle introduction to the use of Procmail. It was last updated on 22 September 1997, but it's been some time since I've added groundbreaking new content. If you have suggestions for new material, drop me an email!

NOTE: (9/30/2003) It's hard to believe, but I still get Procmail questions from readers of this tutorial, after six years! At this point, most of the questions I get are about how to filter spam with Procmail. I've included some links on this issue at the end of the tutorial.
This file is currently maintained by Ian Soboroff. I can be reached at ian@umbc.edu. Please feel free to mail me concerning questions, additions, or corrections.

Since there seems to be a bit of confusion on this point, I didn't write Procmail. I wish I had, since it's a fine piece of software, but I only wrote this tutorial. I am not a source for Procmail software distributions, or for help on compiling or installing it. You can get the software via FTP, and you should find manual pages and such with the distribution there. Once you've got Procmail set up (or confirmed that it is already set up at your site!), come on back and follow this tutorial.


Table of Contents


What is it?

Procmail is a program for filtering electronic mail. It is very useful for presorting and preprocessing large amounts of incoming mail. You can use it to sort out mail from mailing lists, to dispose of junk mail, to send automatic replies, or even to run a mailing list.

You control Procmail yourself, through a file that you put in your home directory. This Web page will guide you through the complexities of writing this file.

This web page is meant to cover the basics. First, I'll walk through a sample filter setup. After that, I'll build a set of filters from scratch as a tutorial. This should be sufficient to get you up and running, using most of Procmail's normal features.

Procmail has several manual pages (online help); their titles and how to read them is discussed at the end. I've also included links to a couple spam-filtering utilities; for more, you might try Googling for "spam filter procmail"


Getting Started

This document is geared towards using Procmail on the student and faculty systems at UMBC (i.e., UMBC7, 8, or 9, or the general-use workstations). If you're planning on using Procmail on a different system, you should consult your system administrator. You can find a link to the Procmail ftp site at the end of this file.

Currently, these UMBC systems are already running Procmail. All you need to do is compose a special file, called .procmailrc (don't forget that leading dot!), which describes the sorting criteria. Once you have this file in your $MAIL directory, Procmail will automatically be run on any incoming mail you receive.

Side Note -- a bit of Unix trickery

Files in Unix that begin with a dot '.' are hidden files. So, when you use the ls command to view the files in your home directory, you may not see the .procmailrc file, or any other so-called "dot-files", right away. To see hidden files in your directory, use the '-a' option, as in ls -a. The '-a' stands for "all files," and will show you both hidden and visible files in one listing.

The whole trick to Procmail is writing the .procmailrc file. However, to the beginner, the format may look like some magical incantation, so I'll start with a small example (actually, an excerpt from my personal .procmailrc!) and walk through it. This is going to entail discussion of a lot of particulars and details, but don't worry; if things seem to digress or just plain stop making sense, odds are they'll be explained more fully later. After that, I'll construct a new .procmailrc, tutorial-style.

Sample .procmailrc:

# .procmailrc
# routes incoming mail to appropriate mailboxes
PATH=/usr/bin:/usr/local/bin
MAILDIR=$HOME/.mailspool   # all mailboxes are in .mailspool/
DEFAULT=$HOME/.mailspool/ian
LOGFILE=/dev/null
SHELL=/bin/sh

# Put mail from DC-Linux mailing list into mailbox dclinux
:0:
* ^(From|Cc|To).*dc-linux
dclinux

Now, don't panic, it's not as bad as it looks. A .procmailrc has two parts, assignments and recipes. The assignments set up variables so that Procmail knows where programs and mailboxes are; that's the top part. The recipe is the incantation at the bottom. Anything preceeded by a hash mark (#) is a comment, and is ignored.

Assignments

The assignments section tells Procmail where to find things, such as your mailboxes, or programs that it might need to run. The set of assignments above pretty much cover what most users should need; the full set is discussed in the procmailrc man page.

Here are descriptions of the assignments in the excerpt above. They take the format variable-name = value.

PATH=/bin:/usr/bin:/usr/local/bin
This tells Procmail where to look for other programs. Here at UMBC, the procmail programs are in /usr/local/bin. I also have /bin and /usr/bin here, for programs that might be run from there. Directories are separated by colons.
MAILDIR=$HOME/.mailspool   # all mailboxes are in .mailspool/
MAILDIR is the location of your mailboxes. $HOME stands for your home directory; therefore, this points to the .mailspool directory in my home directory. Most users have this directory, containing their incoming-mail folder, however, you should probably double-check that the directory exists.
DEFAULT=$HOME/.mailspool/ian
DEFAULT is the default place for Procmail to put your mail; this should be your regular incoming mailbox. For me and most users, this is $HOME/.mailspool/ian (except with your username); but it could be something else, such as $HOME/.Maildrop.
LOGFILE=/dev/null
This specifies the name of a file to use as a usage log, to which Procmail will write any diagnostic or error messages during its activity. In this example, logging is being done to /dev/null, which is kind of a system black-hole for dumping unwanted data; no logs will be kept by this .procmailrc file. As an example, the following LOGFILE assignment creates daily logs with the appropriate date, using the UNIX date command:
LOGFILE=$MAILDIR/log.`date +%y-%m-%d`
These logs would go in the same place as the mailboxes, specified above in the MAILDIR assignment. Logs can be useful for tracking down errors; use them with new recipes, then delete them when you know they work.
SHELL=/bin/sh
This defines a shell, or operating environment, for Procmail to run other commands in.

Recipes

OK, now with that boring stuff out of the way, now we can get on to the interesting part -- recipes. Recipes are where the real work of filtering is done. Things can get kind of complex here, but bear with the details for now... the tutorial afterwards should clear up any remaining fuzziness.

Recipes have the following format:

:0 [flags] [: [lock-file] ]
zero or more conditions
one action line
The flags and lock-file business I'll cover later. The idea is that if the conditions are met, the action is performed. Now, let's look again at the simple recipe from above, which filters my mail from a DC-area Linux users group into it's own mailbox:
# Put mail from DC-Linux mailing list into mailbox dclinux
:0:
* ^(From|Cc|To).*dc-linux
dclinux
The action line in this case is simple: it's just 'dclinux,' the name of the folder to put the mail into. The action could also be an address to forward the mail to, or a program to start, or even a block of commands. We'll see more complex examples later.

The condition tells Procmail what to look for in a mail message. They begin with a '*', and the rest is a pattern to look for. If part of the message matches this pattern, then Procmail will apply the action. The pattern is called a regular expression, and takes some explaining. To briefly translate before I dive in, this pattern translates to:

at the beginning of a line, 'From' or 'Cc' or 'To', followed by some number of characters, followed by 'dc-linux'.
Thus, this pattern would match messages with 'dc-linux' in the From, Cc, or To lines of the header. Neat, huh?

Most .procmailrc files have more than one recipe. The rule is, unless you tell it otherwise, Procmail will stop at the first recipe that matches the message. I'll show how to get around this in the tutorial.

Regular Expressions

Regular expressions are actually reasonably simple, once you get the hang of them.

First and foremost, any character that isn't a special character mentioned below matches itself. This includes all letters and numbers, and some punctuation. That is to say, the regular expression

	Bob
matches the string "Bob". In Procmail, regular expressions are case insensitive, so this will also match "bob", or "bOb", or "BOB", for that matter.

A dot '.' matches any character except a newline. So, the expression

	.ob Jones
will match the string "Bob Jones", but also "Rob Jones" and "Qob Jones", too.

Any character followed by a star '*' matches that character repeated 0 or more times. Thus,

	Bob* Jones
matches "Bo Jones", "Bob Jones", or "Bobbbbbbbbbb Jones". The expression ".*" matches any number of unspecified characters.

Related are the '+' and '?' modifiers. The expression "a+" matches one or more a's. The expression "a?" matches zero or one a.

You can use parentheses to group an expression for use with a modifier. So, the expression

	B(ob)+
matches "Bob", and also "Bobobobobobob".

If one character in a pattern could be one of several, you can use a character class. For example:

	Part [abcd]
matches "Part a", "Part b", "Part c", and "Part d". If the first character of a class is '^', the class matches anything _not_ in the class. For example:
	[^aeiou]+
matches any series of one or more non-vowel characters.

One more operator is the '|' (vertical-bar) character. It is used to match either of two expressions. For example:

	Bob|Joe
will match "Bob" or "Joe".

The last two special characters I want to mention are '^' and '$'. Incidentially, here I'm referring to a '^' that isn't inside a character class. '^' means the beginning of a line, and '$' means the end of one. So,

	^To:
would match the letters 'To:' at the beginning of a line. If that looks suspiciously like part of a mail header, consider it a preview. ;-)

This comprises most of the special characters that Procmail uses in regular expressions. There are a few others, but the manual pages for egrep and procmailrc explains them as well, and if I'm not careful this will turn into a help sheet on regular expressions!

Now, what is all this about matching, anyway? Well, now you should be able to see that your regular expression recipe represents a pattern in a mail message. We will use regualr expressions to tell Procmail what patterns to look for. Next, I'll walk through the construction of several recipes, and you'll see how it's done.


Recipe Concoction Tutorial

Now, let's construct a .procmailrc file as we might in real-life. Hopefully this will make a lot of the cluttered details up above make a little more sense. For other examples, read the procmailex manual page.

We're going to use the same assignments section as described above. Unless you have your mailbox in an odd place, or want to use logs, you'll probably find what I've included to be just fine.

Let's say we belong (as I do) to the mailing list Israeline, which sends out daily news clipping collections from Israeli news sources. It might be nice to automatically have these digests automatically placed in a special mail folder, which we'll call 'israel'.

Mail from this list comes addressed like so:

To: Multiple recipients of list <israeline@nysernet.org>
This has changed in the past, but it always has that address in it, so we'll use part of that as our pattern. Our pattern will be to match "a line starting with 'To:' and containing 'israeline'", or ^To:.*israeline. The recipe will look like this:
:0:                # the last colon means use a lockfile
* ^To:.*israeline
israel             # put these messages in the 'israel' folder
One thing to remember, by the way, is don't put any comments on a condition line. If you do, Procmail will think the comment is part of the pattern!

Ok, now what's all this about a 'lockfile'? Well, suppose two israeline messages came in at about the same time. It's very possible that the mail system would fire up two copies of Procmail, and each would try to write its message to your 'israel' folder! By using a lockfile, the first Procmail that gets run will 'lock' the folder so only it can write to it; any other Procmail trying to write to that folder will have to wait until the first is finished. Using lockfiles may slow down your mail delivery ever so slightly, but it's better than mangled mail.

Now, suppose your colleague Bob likes to send you lists of jokes that he finds around the Net every so often, usually with "joke" or "funny" in the subject line. We don't want this frivolity cluttering our otherwise clean, businesslike work mailbox, so we'll forward it to our account at the university. The tricky part is we want to make sure we don't forward Bob's vital business memos too. We'll use two conditions in the recipe; one to match mail from Bob, and one to match the subject. Here's how the recipe looks:

:0  # forward jokes to my wossamatta u. account
* ^From.*bob
* ^Subject:.*(joke|funny)
! rocky@wossamatta.edu
Three things to note here. First, forwarding mail is done with the '!' at the beginning of the action line, followed by the address. Second, notice that I don't have a colon after 'From' in that condition. This is a quirk of mail headers; there are header From lines with and without colons, so leaving it off is the safest bet. Third, since we're just forwarding the mail and not writing to a file, we don't need a lockfile.

Of course, even though I'm sending the joke mail off somewhere else, I'd still like to read the jokes, even if they're not in my mailbox! We could print out those messages, as well as forwarding them; that way we could read them and no one would know...

The new thing here, besides having an action run a program, is that we're going to modify the above recipe so we have two actions. We'll do this with a technique called nesting. Here's the modified recipe:

:0:  # forward jokes to my wossamatta u. account
* ^From.*bob
* ^Subject:.*(joke|funny)
{
  :0 c
  ! rocky@wossamatta.edu

  :0
  | lpr -Pacsps
}
Instead of an action line, we're using a nested block, which is enclosed in braces. This block is like a secondary .procmailrc file; in it, we can put any number of recipes, which will only be used if the 'parent' recipe applies.

The first recipe in the block is to send off the mail. It uses a flag in its first line, a 'c'. The 'c' flag means to copy the mail, so that the next recipe also gets a copy of the mail, since ordinarily, mail only goes to the first recipe that fits it. The 'c' flag allows us to apply two recipes to a single message.

We send a message to a program using the vertical bar '|' symbol to start off the action line. This means "send the message as input to the following program." In Unix this is called a "pipe". So, here we're piping the mail message to the program "lpr", which will print the message on the printer "acsps".

In a similar way, let's archive the messages we get from another mailing list, called (let's say) "junk". So, while we deliver the messages to our mailbox, we'll keep the body of the messages in a compressed file, which we could unpack later.

:0 bc:   # archive things sent to junk mailing list
* ^To:.*junk
| gzip >> junk-archive.gz
Here we're using two flags. The 'b' flag means that the action line will just take the body of the message, and not the header. The 'c' line, again, means to just take a copy of the message for this recipe, and pass it along to the recipes after. We're using that because we want to archive the message, but we'd also like it to be filed in our mail inbox as usual.

The pipe is another Unixism, telling Procmail to send the message to the compression program "gzip", which will squash the text and put it at the end of the file "junk-archive.gz". This file can be uncompressed for later reading with the "gunzip" command, like so:

	gunzip junk-archive.gz
This covers most of the basic recipes that one might create. The limit from here is only your own needs. The manual pages (described below) will be your best course now. The page called procmailrc describes all the flags you can use, and the page called procmailex contains more examples.

As a sort of quiz, look at the following recipe of mine and try to figure out what it does. I used it as the first recipe in my .procmailrc when I went traveling recently:

:0 Wc: vacation.lock
|/usr/sbin/vacation ian
(hint: look at the manual page for the program 'vacation', and also look at the example in 'procmailex' about sending automatic replies)

What Now?

Everything you can do with Procmail isn't explained here. Once you've read this document and practiced a bit, though, you can also refer to the man pages. Man pages are on-line help; typing
man topic
where topic is usually a command name. Procmail has several man pages which explain aspects of the program:
procmail
The basic description of the program. It discusses options to the procmail program, and has a couple examples at the end.
procmailrc
Detailed description of the format of the .procmailrc file, which controls all the filtering.
procmailex
Several working examples of .procmailrc entries. A very useful resource.
procmailsc
Discusses weight-scoring, a technique for very expert-level filtering.
So, for example, to read the procmailex man page, one would type
umbc9[1]% man procmailex
The regular expressions used by Procmail are the same as those used by the Unix program egrep; these in turn are an extension of the set used by ed, a time-worn editor program. ed's man page is the online bible for regular expressions. egrep's man page discusses the extensions. The procmailrc man page gives a summary.

Procmail is written by Stephen R. van den Berg, at RWTH-Aachen, Germany. The latest version can be found at ftp.informatik.rwth-aachen.de

Also, the comp.mail.misc newsgroup occasionally has traffic on Procmail and mail filtering in general.

(added 9/2003) I get a lot of questions on spam filtering with Procmail. I don't recommend trying to write individual scripts by hand... the spammers are too good, and you'll spend all the time you save writing new Procmail scripts. Instead, you should consider using an external filter, whose output you can process with Procmail. Here are a couple spam-filtering (and generic filtering) packages you can easily use along with Procmail, and which will do a much better job than hand-tuned filtering scripts.

ifile ( http://www.nongnu.org/ifile/)
I put this first because it's what I currently use. ifile is a generic email filtering program which can be used to filter out spam, or even sort all your incoming mail. It uses the now-well-known "Naive Bayes" algorithm to learn what features "look like" spam. (When someone in the spam filtering community finally discovers SVMs, and codes them up as efficiently as ifile does, let me know.) There is a link here which describes how to use ifile with Procmail.
SpamAssassin ( http://spamassassin.org/)
This is one of the best rule-based spam filters. Their web page tells how to use it with Procmail. If you are at UMBC (which I'm not anymore... if this doesn't work, ask systems@umbc.edu, not me), SpamAssassin is already set up and ready to use. See http://www.csee.umbc.edu/systems/spamassassin.html for local details.

Good luck!


Ian Soboroff -- ian@umbc.edu -- University of Maryland Baltimore County